Word Association Norms, Mutual Information and Lexicography
نویسندگان
چکیده
The term word assaciation is used in a very particular sense in the psycholinguistic literature. (Generally speaking, subjects respond quicker than normal to the word "nurse" if it follows a highly associated word such as "doctor.") We wilt extend the term to provide the basis for a statistical description of a variety of interesting linguistic phenomena, ranging from semantic relations of the doctor/nurse type (content word/content word) to lexico-syntactic co-occurrence constraints between verbs and prepositions (content word/function word). This paper will propose a new objective measure based on the information theoretic notion of mutual information, for estimating word association norms from computer readable corpora. (The standard method of obtaining word association norms, testing a few thousand subjects on a few hundred words, is both costly and unreliable.) The , proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, waki,~g it possible to estimate norms for tens of thousands of words. I . Meaning and Association It is common practice in linguistics to classify words not only on the basis of their meanings but also on the basis of their co-occurrence with other words. Running through the whole Firthian tradition, for example, is the theme that "You shall know a word by the company it keeps" (Firth, 1957). "On the one hand, bank ¢o.occors with words and expression such u money, nmu. loan, account, ~ m . c~z~c. o~.ctal, manager, robbery, vaults, wortln# in a, lu action, Fb~Nadonal. of F.ngland, and so forth. On the other hand, we find bank m-occorring with r~r. ~bn, boa:. am (end of course West and Sou~, which have tcqu/red special meanings of their own), on top of the, and of the Rhine." [Hanks (1987), p. 127] The search for increasingly delicate word classes is not new. In lexicography, for example, it goes back at least to the "verb patterns" described in Hornby's Advanced Learner's Dictionary (first edition 1948). What is new is that facilities for the computational storage and analysis of large bodies of natural language have developed significantly in recent years, so that it is now becoming possible to test and apply informal assertions of this kind in a m o r e 76 rigorous way, and to see what company our words do keep. 2. Practical Applications The proposed statistical description has a large number of potentially important applications, including: (a) constraining the language model both for speech recognition and optical character recognition (OCR), (b) providing disambiguation cues for parsing highly ambiguous syntactic structures such as noun compounds, conjunctions, and prepositional phrases, (c) retrieving texts from large databases (e.g., newspapers, patents), (d) enhancing the productivity of computational linguists in compiling lexicons of lexico-syntactic facts, and (e) enhancing the productivity of lexicographers in identifying normal and conventional usage. Consider the optical character recognizer (OCR) application. Suppose that we have an OCR device such as [Kahan, Pavlidis, Baird (1987)], and it has assigned about equal probability to having recognized "farm" and "form," where the context is either: (1) "federal t credit" or (2) "some of." The proposed association measure can make use of the fact that "farm" is much more likely in the first context and "form" is much more likely in the second to resolve the ambiguity. Note that alternative disambiguation methods based on syntactic constraints such as part of speech are unlikely to help in this case since both "form" and "farm" are commonly used as nouns. 3. Word Association and Psycholingui~tics Word association norms are well known to be an important factor in psycholinguistic research, especially in the area of lexical retrieval. Generally speaking, subjects respond quicker than normal to the word "nurse" if it follows a highly associated word such as "doctor." "Some resuhs and impl~tfions ere summarized from rexcfion-fime .experiments in which subjects either (a) ~as~f'mi successive strings of lenen as words and nonwords, c~ (b) pronounced the sUnriSe. Both types of response to words (e.g., BUTTER) were consistently fester when preceded by associated words (e.g., BREAD) rather than unassociated words (e.g, NURSE)." [Meyer, Schvaneveldt and Ruddy (1975), p. 98] Much of this psycholinguistic research is based on empirical estimates of word association norms such as [Palermo and Jenkins (1964)], perhaps the most influential study of its kind, though extremely small and somewhat dated. This study measured 200 words by asking a few thousand subjects to write down a word after each of the 200 words to be measured. Results are reported in tabular form, indicating which words were written down, and by how many subjects, factored by grade level and sex. The word "doctor," for example, is reported on pp. 98-100, to be most often associated with "nurse," followed by "sick," "health," "medicine," "hospital," "man," "sickness," "lawyer," and about 70 more words. 4. An Information Theoretic Measure We propose an alternative measure, the association ratio, for measuring word association norms, based on the information theoretic concept of mutual information. The proposed measure is more objective and less costly than the subjective method employed in [Palermo and Jenkins (1964)]. The association ratio can be scaled up to provide robust estimates of word association norms for a large portion of the language. Using the association ratio measure, the five most associated words are (in order): "dentists," "nurses," "treating," "treat," and "hospitals." What is "mutual information"? According to [Fano (1961), p. 28], if two points (words), x and y, have probabilities P(x) and P(y) , then their mutual information, l(x,y), is defined to be l(x,y) IoP(x,y) s2 P(x) P(y) Informally, mutual information compares the probability of observing x and y together (the joint probability) with the probabilities of observing x and y independently (chance). If there is a genuine association between x and y, then the joint probability P(x,y) will be much larger than chance P(x) P(y), and consequently l(x,y) > > 0. If there is no interesting relationship between x and y, then P(x,y) ~ P(x) P(y), and thus, I(x,y) ~0. If x and y are in complementary distribution, then P(x,y) will be much less than P(x) P(y), forcing l(x,y) << O. In our application, word probabilities, P(x) and P(y), are estimated by counting the number of observations of x and y in a corpus, f (x) and f (y) , and normalizing by N, the size of the corpus. (Our examples use a number of different corpora with different sizes: 15 million words for the 1987 AP 77 corpus, 36 million words for the 1988 AP corpus, and 8.6 million tokens for the tagged corpus.) Joint probabilities, P(x,y), are estimated by counting the number of times that x is followed by y in a window of w words,f,,(x,y), and normalizing by N. The window size parameter allows us to look at different scales. Smaller window sizes will identify fixed expressions (idioms) and other relations that hold over short ranges; larger window sizes will highlight semantic concepts and other relationships that hold over larger scales. For the remainder of this paper, the window size, w, will be set to 5 words as a compromise; this setting is large enough to show some of the constraints between verbs and arguments, but not so large that it would wash out constraints that make use of strict adjacency.1 Since the association ratio becomes unstable when the counts are very small, we will not discuss word pairs with f (x ,y) $ 5. An improvement would make use of t-scores, and throw out pairs that were not significant. Unfortunately, this requffes an estimate of the variance of f (x ,y) , which goes beyond the scope of this paper. For the remainder of this paper, we will adopt the simple but arbitrary threshold, and ignore pairs with small counts. Technically, the association ratio is different from mutual information in two respects. First, joint probabilities are supposed to be symmetric: P(x,y) = P(y,x), and thus, mutual information is also symmetric: l(x,y)=l(y,x). However, the association ratio is not symmetric, since f (x ,y) encodes linear precedence. (Recall that f(x,y) denotes the number of times that word x appears before y in the window of w words, not the number of times the two words appear in either order.) Although we could fix this problem by redefining f(x,y) to be symmetric (by averaging the matrix with its transpose), we have decided not to do so, since order information appears to be very interesting. Notice the asymmetry in the pairs below (computed from 36 million words of 1988 AP text), illustrating a wide variety of biases ranging 1. This definition fw(x,y) uses • rectangular window. It might bc interesting to consider alternatives (e.g., • triangular window or • decaying exponential) that would weight words less and less as they are separated by more and more words. f rom sexism to syntax. A s y m m e t r y in 1988 A P C o r p u s ('N ffi 36 million)
منابع مشابه
Book Review: A Way with Words: Recent Advances in Lexical Theory and Analysis: A Festschrift for Patrick Hanks edited by Gilles-Maurice de Schryver
In his introduction to this collection of articles dedicated to Patrick Hanks, de Schryver presents a quote from Atkins referring to Hanks as “the ideal lexicographer’s lexicographer.” Indeed, Hanks has had a formidable career in lexicography, including playing an editorial role in the production of four major English dictionaries. But Hanks’s achievements reach far beyond lexicography; in part...
متن کاملExploring the Chinese Mental Lexicon with Word Association Norms
Our internal repository of words, often known as the mental lexicon, has primarily been modelled by psychologists as some kind of network. One way to probe its organisation and access mechanisms is by means of word association techniques, which have rarely been applied on Chinese. This paper reports on the design and implementation of a pilot word association test on native Hong Kong Cantonese ...
متن کاملThe Application of Fuzzy Logic to Collocation Extraction
Collocations are important for many tasks of Natural language processing such as information retrieval, machine translation, computational lexicography etc. So far many statistical methods have been used for collocation extraction. Almost all the methods form a classical crisp set of collocation. We propose a fuzzy logic approach of collocation extraction to form a fuzzy set of collocations in ...
متن کاملAutomatic Extraction of English Collocations and their Chinese - English Bilingual Examples : A Computational Tool for Bilingual Lexicography
This paper describes the procedures involved in developing EXEC, a web-based system which can automatically extract English collocations and their Chinese-English bilingual examples from parallel corpora. The system draws on statistics, dependency parsing, and Chinese-English parallel corpora of more than 13 million English words and 27 million Chinese characters. By taking a word as well as th...
متن کاملThe World Within Wikipedia: An Ecology of Mind
Human beings inherit an informational culture transmitted through spoken and written language. A growing body of empirical work supports the mutual influence between language and categorization, suggesting that our cognitive-linguistic environment both reflects and shapes our understanding. By implication, artifacts that manifest this cognitive-linguistic environment, such as Wikipedia, should ...
متن کامل